Xtractor: A light wrapper for XML paragraph-centric documents
نویسنده
چکیده
The emergence of XML leads the development of applications centric XML-documents. Often the documents contain tagged paragraphs of natural language texts. The extraction of relevant data from paragraphs confronts with their irregular structure hidden in the text and requires powerful extraction patterns. Although a large spectrum of wrappers has been conceived to mainly process HTML pages, the wrappers cannot deal with semi-structured data and cannot still take into consideration the natural language processing. In this paper, we present a specification language to write expressive and easy extraction patterns by casual users in a regular expression fashion. Moreover, we introduce the Xtractor, which relies on linguistic parsing of paragraphs and applies technical and natural language dictionaries.
منابع مشابه
Study of the Automatic Construction of XML Documents Models from a Relational Data Model
End-users information capture remains a sensitive challenge, especially when information is under the form of documents. The difficulty concerns information indexing so that information can be precisely queried. In the DRUID project, the end-user captures XML paragraph-centric documents (i.e. documents with tags delimiting narrative text paragraphs), and a transformation tool generates XML data...
متن کاملNATIVE XML DATABASES vs. RELATIONAL DATABASES IN DEALING WITH XML DOCUMENTS
When dealing with data-centric XML documents, it is possible to convert XML documents into a relational database, which can then be queried using SQL. Such relational databases are called XML-enabled databases. On the other hand, the best choice for storing, updating and retrieving document-centric XML documents is usually a native XML database (NXD). NXDs store XML documents as logical units, ...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملRelAndXML: a system to manage XML-based course material with object-relational databases
In this thesis, we present our newly invented system RelAndXML for the management and storage of hypertext-centric XML documents and the according XSL stylesheets. Our sample application area is the course material at university. Typically, course material is being reused on multiple assignments, while it is also important to add or replace questions. Currently, teaching assistants use differen...
متن کاملStatistical Language Models for Intelligent XML Retrieval
The XML standards that are currently emerging have a number of characteristics that can also be found in database management systems, like schemas (DTDs and XML schema) and query languages (XPath and XQuery). Following this line of reasoning, an XML database might resemble traditional database systems. However, XML is more than a language to mark up data; it is also a language to mark up textua...
متن کامل